Let’s make data maps! This is a huge topic, once again, so we’ll keep
things relatively simple for the purposes of this course. There are
plenty of resources online if you’d like to dig further into plotting
spatial data. But, now armed with your ggplot powers, you
will be able to make some compelling and informative maps.
Many public datasets contain latitude and longitude (or other coordinate) columns, where each observation (row) has an exact geographic location associated with it. It can be straightforward to visualize these data on a map, but first you’ll need a map/geographic data source.
Some of the big map sources are Google Maps, and the Open Street Framework. Often, you’ll need to request an API key to use their maps, which is not difficult but it requires making an account, requesting a key, etc., and there are plenty of guides for how to do this online.
Stadia Maps is an
easy-to-access resource for map data that pairs nicely with the
ggmap library for R. It does require an API key, which
involves setting up a free account. I’ve set up an account for our class
already, and have included the API key in the code chunk below. Go ahead
and run the code chunk to load up ggmap (you may need to
install it first) and ggplot2, and connect to the Stadia
Maps API.
library(ggmap)
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.3
## i Google's Terms of Service: <https://mapsplatform.google.com>
## Stadia Maps' Terms of Service: <https://stadiamaps.com/terms-of-service/>
## OpenStreetMap's Tile Usage Policy: <https://operations.osmfoundation.org/policies/tiles/>
## i Please cite ggmap if you use it! Use `citation("ggmap")` for details.
library(ggplot2) # should load in with ggmap automatically, but no harm in doing it explicitly
key = 'e7f97169-8de3-4e18-9683-5e08ee8affa3'
register_stadiamaps(key)
The key differences between ggmap and
ggplot are: we use ggmap() to iniate the plot,
and instead of a dataframe as a first argument, we pass in a map. With
Stadia Maps, this comes from the get_stadiamap() function.
We’ll then link our actual dataframe of interest to the map using
aes() in a separate command (in the geom -
later).
By default, get_stadiamap() returns a map of Houston, TX
with a particular map style. Run ?get_stadiamap at the
Console to see a list of different map styles you can choose from
(though the documentation appears to be incomplete). Play around
with different arguments to maptype = and find one you like
- I am a fan of stamen_toner_lite.
ggmap(get_stadiamap(maptype = 'alidade_smooth_dark'))
## i © Stadia Maps © Stamen Design © OpenMapTiles © OpenStreetMap contributors.
Before we move forward, let’s load in a dataset that contains spatial
coordinate data. For this Part, you can use the
arrests dataset we used last time, OR you can use a dataset
of your choice. If you want to use your own data, all I ask is
that you make an informative, clean, good-looking map and that you write
2-3 sentences describing what the map shows. You can use the code chunks
below as inspiration, but feel free to reorganize the rest of this
section as it makes sense to you. Otherwise, follow along.
These map sources (including Stadia Maps) grant you access to a map
of the entire planet. So, let’s limit our visualization range to that of
the data. get_stadiamap() includes an argument for a
“bounding box”. Think of this as a window into the geographic area we
want to look at. For example, in the plot above, the bounding box ranges
from -95.8 to about -94.9 degrees longitude, and 29.4 to about 30.1
degrees latitude.
# Use the arrests data provided, or your own dataset
arrests = read.csv('arrests.csv')
We could find a bounding box for our region manually, but perhaps a
better way is to let the data inform us. Using max() and
min(), complete the bounding box code below to determine
the range of geographical coordinates we should draw for the NYPD
arrests dataset.
# Make sure each line except the last one ends with a comma,
bbox = c(
bottom = min(arrests$Latitude),
top = max(arrests$Latitude),
left = min(arrests$Longitude),
right = max(arrests$Longitude)
)
# Call 'bbox' by name to check that you stored the coordinates
bbox
## bottom top left right
## 40.49940 40.91272 -74.25184 -73.70161
Now you can pass in bbox as the first argument to
get_stadiamap() and it will return a map bounded by those
coordinates. There is also a zoom = argument that defines
the resolution of the map. Run ggmap(get_stadiamap())
giving arguments bbox and a zoom of 11, with a
maptype that you like. Assign the output to a variable
called map.
While you’re at it, get rid of the ugly background grid by setting
the panel.background to a blank element with
element_blank() using theme(). The
latitude/longitude exact degrees are not adding anything either, so get
rid of them, too, in the same call to theme by setting
axis.ticks, axis.text, and
axis.title to element_blank(). An informative
title would be good, though, so add one using
labs(title = 'your title'). Note that the plot title is
different from the axis titles (‘Latitude’ and ‘Longitude’)
# Complete the code here
map = ggmap(get_stadiamap(bbox, zoom = 11, maptype = 'stamen_toner_lite')) +
theme(panel.background = element_blank(),
axis.ticks = element_blank(),
axis.text = element_blank()) +
labs(title = 'Arrests in NYC')
## i © Stadia Maps © Stamen Design © OpenMapTiles © OpenStreetMap contributors.
# Call 'map' by name to see the plot
map
In the Lab, we explored a few ways to plot densities on a map using
geom_point(), geom_density_2d(), and
geom_density_2d_filled(). Run and study the code below to
see how to add this type of geom to your map. Notice that we’re giving a
dataframe as well as an aes() mapping to the geom
specifically. The argument alpha = controls the
transparency of the geom and ranges from 0 (fully transparent) to 1
(fully opaque). We’re also using logical indexing to
extract the rows of arrests that only contain offenses
(OFNS_DESC) related to dangerous weapons.
offense_type = 'DANGEROUS WEAPONS'
map +
geom_density_2d_filled(data = arrests[arrests$OFNS_DESC == offense_type,],
aes(x = Longitude, y = Latitude),
alpha = 0.4)
Do you see any “hotspots” where arrests are likely to be made for dangerous weapons?
— Your description of the map here —
What are the possible values that the OFNS_DESC variable
can take? The unique() function is very useful when
exploring a dataset. Run the unique() function below, passing in
the OFNS_DESC column using the $
operator.
unique(arrests$OFNS_DESC)
## [1] "RAPE"
## [2] "ARSON"
## [3] "SEX CRIMES"
## [4] ""
## [5] "ASSAULT 3 & RELATED OFFENSES"
## [6] "FELONY ASSAULT"
## [7] "DANGEROUS WEAPONS"
## [8] "MISCELLANEOUS PENAL LAW"
## [9] "PETIT LARCENY"
## [10] "GRAND LARCENY OF MOTOR VEHICLE"
## [11] "DANGEROUS DRUGS"
## [12] "NYS LAWS-UNCLASSIFIED FELONY"
## [13] "GRAND LARCENY"
## [14] "OTHER OFFENSES RELATED TO THEF"
## [15] "PROSTITUTION & RELATED OFFENSES"
## [16] "OFFENSES AGAINST PUBLIC ADMINI"
## [17] "ROBBERY"
## [18] "FOR OTHER AUTHORITIES"
## [19] "CRIMINAL MISCHIEF & RELATED OF"
## [20] "OTHER TRAFFIC INFRACTION"
## [21] "FORGERY"
## [22] "INTOXICATED & IMPAIRED DRIVING"
## [23] "MURDER & NON-NEGL. MANSLAUGHTE"
## [24] "UNAUTHORIZED USE OF A VEHICLE"
## [25] "BURGLARY"
## [26] "OFFENSES INVOLVING FRAUD"
## [27] "VEHICLE AND TRAFFIC LAWS"
## [28] "OFF. AGNST PUB ORD SENSBLTY &"
## [29] "THEFT OF SERVICES"
## [30] "CRIMINAL TRESPASS"
## [31] "POSSESSION OF STOLEN PROPERTY"
## [32] "OTHER STATE LAWS (NON PENAL LA"
## [33] "FRAUDULENT ACCOSTING"
## [34] "OFFENSES AGAINST THE PERSON"
## [35] "FRAUDS"
## [36] "OTHER STATE LAWS"
## [37] "THEFT-FRAUD"
## [38] "INTOXICATED/IMPAIRED DRIVING"
## [39] "GAMBLING"
## [40] "HOMICIDE-NEGLIGENT-VEHICLE"
## [41] "ADMINISTRATIVE CODE"
## [42] "BURGLAR'S TOOLS"
## [43] "ALCOHOLIC BEVERAGE CONTROL LAW"
## [44] "ENDAN WELFARE INCOMP"
## [45] "PARKING OFFENSES"
## [46] "OFFENSES AGAINST PUBLIC SAFETY"
## [47] "JOSTLING"
## [48] "MOVING INFRACTIONS"
## [49] "KIDNAPPING & RELATED OFFENSES"
## [50] "KIDNAPPING"
## [51] "HOMICIDE-NEGLIGENT,UNCLASSIFIE"
## [52] "AGRICULTURE & MRKTS LAW-UNCLASSIFIED"
## [53] "OFFENSES RELATED TO CHILDREN"
## [54] "ANTICIPATORY OFFENSES"
## [55] "HARRASSMENT 2"
## [56] "DISORDERLY CONDUCT"
## [57] "OTHER STATE LAWS (NON PENAL LAW)"
## [58] "CHILD ABANDONMENT/NON SUPPORT"
## [59] "ESCAPE 3"
## [60] "LOITERING/GAMBLING (CARDS, DIC"
## [61] "KIDNAPPING AND RELATED OFFENSES"
## [62] "ADMINISTRATIVE CODES"
## [63] "FELONY SEX CRIMES"
## [64] "NEW YORK CITY HEALTH CODE"
Copy+paste the code chunk above containing
geom_density_2d_filled, and change the ‘DANGEROUS WEAPONS’
condition to a different type of offense. Try it with more than one type
of geom, and play with the geoms options (like
alpha, linewidth), or change the color palette
(e.g. ) - to get a data map that you’re happy with. Describe
the resulting map - is it similar or different?
# Store the offense type value in a variable so it's easy to reuse with multiple geoms
offense_type = 'KIDNAPPING'
# Plotting code goes here
map +
geom_density_2d_filled(data = arrests[arrests$OFNS_DESC == offense_type,],
aes(x = Longitude, y = Latitude),
alpha = 0.4)
— Your description of the map here —
So far, all of our plots have shown up in the Plots panel, or as
output in the R Markdown document. Eventually you’ll want to get figures
out of RStudio and into a file that you can place in a Word document or
a Powerpoint/Google slide, for example. Since we didn’t get to cover it
in lab, I’ve written some code below that will export any R plot to a
file on your computer. There is a function ggsave() for
ggplots specifically, but the method below will work for any plot.
The steps are somewhat un-intuitive at first, but here’s how they go:
(1) create an empty image file with a function specific to the file type
- this turns on the “graphics device” (2) render your plot with
print(my_plot) - this writes the image to the file (3) turn
off the “graphics device”
# Run getwd() below, which stands for "get working directory", this isn't necessary
# but it can be handy to know exactly where your figure file is going to be saved
getwd()
## [1] "Z:/je2527/Classes/2024_SP ^TA^ Lab in Justice Data Science/Lab 5 - 2.21/assignment"
# Run the line below, which creates a blank PNG image
png('test_image.png', units = 'in', width = 5, height = 4, res = 100)
Look in the folder on your computer indicated by getwd()
and you should have a blank image called
test_image.png.
Now, let’s actually export a plot. First, though, we need a plot
stored in the environment. Copy+paste your final map code from Part 1
above, and store the plot into a new variable called
my_map.
my_map = map +
geom_density_2d_filled(data = arrests[arrests$OFNS_DESC == offense_type,],
aes(x = Longitude, y = Latitude),
alpha = 0.4)
Run the code chunk below to save your plot to both .png (raster) and .pdf (vector) files. NOTE: this won’t work if you run line-by-line; so run the whole chunk at once by clicking the green arrow or clicking anywhere inside the chunk and hitting Shift+Cmd+Return (Mac) or Shift+Ctrl+Enter (Windows)
# Open a "graphics device" - here let's to save to a .png
png('my_map_figure.png', units = 'in', width = 5, height = 4, res = 100)
# This displays the ggmap you just made
print(my_map)
# This is necessary to save the plot to the file (don't ask me why)
# it seems you have to run the code chunk all at once for this to work
# (can't run line-by-line)
dev.off()
## png
## 2
# Now save to a pdf
pdf('my_map_figure.pdf', width = 5, height = 4)
print(my_map)
dev.off()
## png
## 2
Take a moment to view both image files on your computer. Try zooming in on both images. Does any part of the image get pixelated in either image as you zoom? What about text (like the plot title)? Are you able to click and highlight text in the .pdf file?
– Your observations on file type differences here –
The year is 1854. You have just arrived in Soho, London after
receiving a distress signal from Dr. John Snow, leader
of the “Guardians of Inter-dimensional Data”. Snow has summoned you to
assist with a deadly disease outbreak that threatens to rewrite the
course of history on Earth. Cholera, which causes diarrhea and
life-threatening dehydration, has reached epidemic proportions in
London. Snow’s hypothesis is that deaths from cholera are
linked to the neighborhood’s water supply, which is provided by a number
of pumps scattered throughout the area. After your
briefing, Snow gives you his data core containing the number and
locations of deaths, the locations of the
pumps, and a map of Soho. These files seem to
be in an alien format, but luckily, before traveling through the time
portal to 1854 with your laptop, you installed the R packages
sf and raster by running
install.packages(c('sf','raster')) in the console.
Snow tells you that the data are contained within
shapefiles, an ancient technology that is poorly
understood. However, special functions in the sf and
raster packages can read these files and even visualize
them with proper equipment. He instructs you to run the lines below.
library(sf)
## Warning: package 'sf' was built under R version 4.1.3
## Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 7.2.1; sf_use_s2() is TRUE
library(ggplot2)
library(raster)
## Warning: package 'raster' was built under R version 4.1.3
## Loading required package: sp
deaths = st_read('SnowData/Cholera_deaths.shp')
## Reading layer `Cholera_deaths' from data source
## `Z:\je2527\Classes\2024_SP ^TA^ Lab in Justice Data Science\Lab 5 - 2.21\assignment\SnowData\Cholera_deaths.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 250 features and 2 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 529160.3 ymin: 180857.9 xmax: 529655.9 ymax: 181306.2
## Projected CRS: OSGB 1936 / British National Grid
pumps = st_read('SnowData/Pumps.shp')
## Reading layer `Pumps' from data source
## `Z:\je2527\Classes\2024_SP ^TA^ Lab in Justice Data Science\Lab 5 - 2.21\assignment\SnowData\Pumps.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 8 features and 1 field
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 529183.7 ymin: 180660.5 xmax: 529748.9 ymax: 181193.7
## Projected CRS: OSGB 1936 / British National Grid
map = raster('SnowData/SnowMap.tif')
map = as.data.frame(map, xy = TRUE)
Snow then connects your laptop to a holographic visualizer, and while
doing so, describes a special geom called
geom_sf() that works nicely with data contained in
shapefiles. To your surprise, he tells you that linking the
data to X and Y coordinates using aes() is not necessary,
as geom_sf() interprets the geometry columns
of shapefiles as coordinates. He encourages you to look at
the structure of deaths and pumps by clicking
on them in your Environment, and then to run the code chunk below.
ggplot() +
geom_sf(data = deaths, aes(size = Count)) +
geom_sf(data = pumps, shape = 21, size = 4, color = 'black', fill = '#35cc97')
Inquisitively, you turn to Snow, then point to the strange code
fill = '#35cc97' above. He calls this representation of
color a hex code and tells you how to obtain them. You
connect to Inter-dimensional Wi-fi and Google “HTML Color Picker”. An
interactive panel appears, with a “HEX” box below containing a #
followed by a 6-letter/digit code, which you can copy/paste into
RStudio. You then point to the part that says shape = 21.
Snow describes how “point” geoms, or “markers”, have many
different available shapes, which are identified by a
number. You Google “R Marker Shapes” and find that the numbers 0-25
represent different shapes such as circles, squares, and
triangles. Some of them have color and fill
properties, some of them only color. Play around
with HEX colors, marker shapes, and sizes to customize the plot above.
NOTE: none of these arguments should be inside
aes()!
Snow, getting frustrated with the amount of time you’re spending
beautifying this plot, reminds you of the reason you are here - to find
the source of the disease outbreak!! Looking at your plot (which is
beautiful), you and Snow are both convinced that the pump
at the center of the deaths cluster must be related to the
outbreak. You tell Snow that surely this figure is evidence enough for
the Inter-dimensional Council to take action. He refuses, exclaiming
that the Council will not be persuaded unless you
quantitatively show that this is the correct
pump. He rubs his forehead and mutters that there must be
some way to quantify which pump is central to the
outbreak.
Eureka! You have an idea: because these data are spatial - you could
find the pump that minimizes the total (summed) distance to
each of the recorded deaths! Snow’s eyes widen. He turns a
hopeful glance towards you.
You show Snow how you can easily find the distance between a single
death and a single pump with the code
below:
st_distance(deaths[1,'geometry'], pumps[1,'geometry'])
## Units: [m]
## [,1]
## [1,] 88.02289
You then show Snow how to find
death-pump distances for multiple locations by
changing deaths[1,'geometry'] to
deaths[1:5,'geometry']. Try this, then try leaving the “row
position” blank: deaths[,'geometry']. Tell me what the
output shows you.
— Your description here —
Snow jumps up and down in excitement. You describe to him an
algorithm for finding the pump that is
nearest to all deaths: 1. initiate a list of
distances starting at 0 for each pump 2. for
each pump i: + 1. compute the distance from
pump i to each row of deaths + 2.
store the sum of those distances in the ith element of
distances 3. the pump corresponding to the
minimum of distances is the central pump
(Jacob here: see if you can implement this algorithm. There are many
other ways to go about this - feel free to experiment.
distances should be a numeric vector of length 8. It should
be all 0’s at first, then all large numbers at the end. The functions
nrow(), numeric(), sum(), and
st_distance() may be useful. The for loop
isn’t necessary but may be useful. If you get stuck, check the
A5hints.txt file on Courseworks)
# Compute the summed distances from each pump to all death locations
n_pumps = nrow(pumps)
distances = numeric(n_pumps)
for (i in 1:n_pumps) {
pump_dist = st_distance(deaths[,'geometry'], pumps[i,'geometry'])
distances[i] = sum(pump_dist)
}
# Show the list of summed distances
distances
## [1] 31509.16 61129.79 75082.31 91563.54 66150.23 59459.83 107222.49
## [8] 71052.50
What is the information contained in the variable
distances?
— Your description here —
To find the pump that minimizes the distances to all
deaths, you want the index or position of
the minimum value in distances (i.e. an integer from 1-8).
Try to get this on your own by googling, or see the A5hints.txt
file.
# Find the index of the minimum element of distances
closest_pump_idx = which(distances == min(distances))
What is the value of closest_pump_idx? What
information does this give you?
— Your description here —
Having quantitatively found which pump is the probable
cause for the outbreak, you and Snow prepare a visualization for the
Inter-dimensional Council. The chunk below will plot deaths
and pumps on top of Snow’s original map, with an additional
marker for the pump you’ve identified as the culprit.
Customize it however you like!
ggplot() +
geom_raster(data = map, aes(x = x, y = y, fill = SnowMap_1)) + # first 2 lines draw Snow's original map
scale_fill_gradient(low = 'black', high = 'white') + # makes sure map color is black & white
geom_sf(data = deaths, aes(size = Count * 1.1), shape = 21, color = 'black', fill = '#77a35b') +
geom_sf(data = pumps, color = 'black', fill = '#7af5f1', size = 4, shape = 25) +
geom_sf(data = pumps[closest_pump_idx,],
color = 'black', fill = '#ff54ee', size = 5, shape = 25)
Congratulations! By locating the pump, you’ve ended an
epidemic, saved countless lives, and sky-rocketed John Snow’s career.
You step into the time portal to return home, knit this file to HTML,
and submit to Courseworks.
On a scale of 1-5, with 1 being “too easy”, 5 being “too hard”, and 3 being “a good level”, how difficult was this assignment?
Approximately how much time did you spend working on this assignment? (i.e. actively solving the problem, not exploring the data independently)
Any thoughts on what you found useful, or what you found mundane, or confusing? (anything, big or small)